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Feature importance strategy that substantially impacts software development 
effort estimation (SDEE) can help lower the dimensionality of dataset size. 
SDEE models developed to estimate effort, time, and wealth required to 
accomplish a software product on a limited budget are used more frequently 
by project managers as decision-support tool effort estimation algorithms 
trained on a dataset containing essential elements to improve their estimation 
accuracy. Earlier research worked on creating and testing various estimation 
methods to get accurate. On the other hand, ensemble produces superior 
prediction accuracy than single approaches. Therefore, this study aims to 
identify, develop, and deploy an ensemble approach feasible and practical 
for forecasting software development activities with limited time and 
minimum effort. This paper proposed a collaborative system containing a 
multi-level ensemble approach. The first level grabs the optimal features by 
adopting boosting techniques that impact the decided target; this subset 
features forward to the second level developed by a stacked ensemble to 
compute the product development effort concerning lines of code (LOC) and 
actual. The proposed model yields high accuracy and is more accurate than 
distinct models. 
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1. INTRODUCTION 

One of the most significant tasks in the software industry is to develop a quality software product 
with minimal components depending on how accurately it estimates software development effort [1]. The 
challenge is evaluating those metrics early in the project lifecycle when each effort's limits must be 
determined, and there are significant uncertainties about the end product's functionality. It was defined as 
"estimating the effort and time required to develop a software product. The accuracy of its effort estimates 
primarily determines the success of any software product. Kumar et al. [2] demonstrates that the reasons for a 
software product failure are idealistic or inarticulate project goals, erroneous resource estimates, and inability 
to handle product difficulty. Perfect effort estimates are critical for project success. In papers [3]—[5] defines 
a reasonable estimation as providing a clear enough view of the product reality to allow project management 
to make sound decisions about overseeing the product to meet its objectives. 

Software effort estimating (SEE) approaches of various types have been presented [6]. Among the 
suggested techniques, machine learning (ML) based effort estimators such as support vector machines (SVM), 
decision tree function (DTF) networks, and random forest trees (RFTs) have drawn more attention [7]. Making 
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no or few assumptions about the function being modelled and the training data is the driving force behind 
deploying such techniques. Such methods are preferred since they don't or lightly assume things about the 
modelled function and the training data. For instance, Rao and Rao [8] demonstrated that ensemble techniques 
outperform single classification models in SEE because the voting classifier in the ensemble model reduces any 
residual effect related to feature insignificance and redundancy. To mitigate this, higher weights are given to 
specific classifiers that excel on the tested datasets. The prediction performance is undoubtedly improved by the 
robustness of irrelevant and redundant features. In its simplest form of averaging, the voting model assures the 
reduction of noise property, which improves the overall prediction performance. 

This paper proposes a multilevel ensemble (MLE) learning module for the software development 
effort estimation method. The proposed MLE system incorporates adaptive boost gradient tree boost in the 
first level and uses seven individual classifiers in the second, including the proposed stacked ensemble. The 
research sequel states that the base classifiers have been chosen following thorough simulation validation. 
Some of these classifiers, including the RF and SVM models, are considered in the literature [8]. The basis 
classifiers' diverse classification abilities also allow them to distinguish between various statistical properties 
of the underlying data, which adds value to the proposed ensemble learning approach. 

The proposed ensemble model needs effective feature selection models to perform better overall. 
The enhancement's final effects will determine how well the redundant and unnecessary features in software 
product datasets are handled. This research aims to show how feature selection improves effort estimation 
performance and suggest a multilevel ensemble learning technique that is resistant to data imbalance and 
feature redundancy. The proposed multilevel ensemble technique has additionally demonstrated enhanced 
resistance to redundant and irrelevant characteristics, substantially contributing to this research. This research 
aims to show how feature importance improves effort estimation and suggests a multilevel ensemble learning 
technique resistant to feature redundancy and data imbalance. Another significant contribution credited to 
this research is the improved robustness of the suggested MLE to redundant and irrelevant information. 

This paper has contained two innovative discoveries. Section 2 describes the literature-related 
research on the estimation methods for software development effort estimation. The ML models neural 
networks (NN 30-30), linear regression (LR), k-nearest neighbor (K-NN), SVM radial basis function (SVM 
RBF), naive bayes (NB), SVM polynomial (SVM poly), have to consider combining some of the best 
features of the suggested method are discussed with experimental setup in section 3, proposed multilevel 
ensemble learning model and summarizes the results of the studies and demonstrates the experimental design 
in section 4. The research is concluded in section 5, along with its future scope. 


2. RELATED WORK 

This section describes a summary of ML strategies offered after a study of general effort estimation 
algorithms, concludes with a review of various classification methods and methodologies, as well as a 
comparison of ways that can be used to estimate software development effort. 


2.1. Single classifier for software development effort estimation 

Quality development has evolved into a critical activity for professional companies. Indeed, developed 
software's prominence, cost, and suitability are frequently decisive elements in an organization's success. The 
complete analysis of ML techniques used for effort estimation was carried out by [9]. According to researchers’ 
analyzed work, the researchers mainly focused on customizing specific algorithms, particularly artificial neural 
networks, case-based reasoning models, and decision trees, for the most outstanding performance. The 
machine's precision with mean magnitude relative errors (MMRE) ranging from 35 to 55%, percentage close 
error deviations (PRED(25)) of 47 to 75%, and median magnitude relative errors (MdMRE) of 30 to 55%, 
learning models were of an acceptable level and outperformed statistical ones. According to the researchers, ML 
algorithms may produce disparate findings due to outliers, missing variables, and the chance of over fitting 
problems. To estimate the early stages of the software life cycle initiatives, LR, and NN [10]. Shahpar et al. [11] 
investigated several data sets and obtained encouraging findings for software development effort assessment. 
When estimating software maintenance effort using particle swarm optimization, Singh et al. [12] proposed a 
successful swarm intelligence-based method. Regardless of the approach used to develop ML, valuable 
recommendations for effort and duration estimation at early project stages can be retrieved. Because ML is 
sensitive to noise in data sets, models should not rely on unique algorithms but should be employed in tandem, 
which improves prediction accuracy [13]. Boosting, bagging, and complex random sampling techniques [14] 
were proposed by researchers, generally for the same sort of ML algorithms. However, if used excessively, 
ensemble methods can cause significant performance overhead [15]. As a result, for developing ML effort and 
duration models, a limited selection of algorithms and a simple ensemble method, such as averaging of acquired 
estimates, should be employed. 
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2.2. Evolutionary strategy 

A hybrid method to estimate work using the use case point methodology has been put out by [16]. 
Numerous observations were made based on college student projects and industrial projects. The authors of 
this paper gave the environmental elements of the UCP approach significant weight. The researchers used 
feed-forward algorithms like radial basis feed forward neural (RBFNN) to predict the effort and productivity 
feed-forward algorithm. This project concluded that the UCP method's environmental considerations are 
ideal for software system productivity forecasting. 

More recently, [17]-[19] examined the application of learning machine ensembles for SEE. 
Ensembles of learning machines are groups of learners trained to complete the same job and are put together 
to enhance prediction performance [8]. It is generally accepted that learners should act differently when 
combined to obtain more accurate predictions. Otherwise, the total forecast won't be more accurate than the 
individual guesses. Therefore, several ensemble learning strategies can be viewed as various ways to create 
variation among the base learners. The authors tried to estimate effort with a low failure ratio and cost. 

None of the publications compares the outcomes of other easily accessible methods for ensemble 
learning from the ML literature and the issues raised above. Reseachers in [20], [21] provide data from a few 
ensemble approaches. However, the research does not statistically compare these methods and single 
learners. Different ensemble approaches can be more or less suitable for SEE and should be included in the 
comparisons. The papers also need to examine how the results were obtained. 


2.3. Other approaches 

According to particular research in the literature, the properties of the data set substantially impact 
how well various models perform. However, as previously indicated, existing research on ensemble models 
suggests that they perform better than distinct models even when numerous different data sets are employed. 
The results of the research methods, as presented in Table 1 (in Appendix) [18], [22]-[40] demonstrated that 
the optimal and significantly improved SDEE estimate performance was obtained by combining their two 
strategies. According to their projections, hybrid approaches may produce satisfactory results for varying 
sizes. 


3. RESEARCH METHODOLOGY 

This proposed research includes MLE approach to grab the optimal features as a precise step in the 
first level approach for choosing SDEE models to calculate effort in level 2. Figure 1 shows the overall 
feature selection process and software development effort estimation. The suggested method efficiently 
assigns ranks for features while simultaneously dealing with the imbalanced data problem in a software quality 
dataset to avoid bias problems. The following feature selection phases are defined to meet the objective. 


Dataset 


Performance 
—> Analysis 


| D={a; bi} 
| Target =“Actual Cost” e 5 
Target =“LOC” an SEMIS 
analysis with 


state of art 


Stacked Ensemble 


Figure |. Prediction of SDEE using proposed MLE 


3.1. Feature subset selection 
A compelling feature selection system requires finding the primitive features that will be used to 
train the models [41]. The relevance or correlation between the characteristic and the class label serves as the 
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selection process's core guiding principle and is frequently applied to classification problems. Relevance 
measurements can be used to assess the significance of qualities like dimensionality reduction with five 
cross-fold validations. The model seeks to identify the best feature subset f (|f|) =k that maximises 
classification accuracy for a given dataset D=(a;, b;), using a feature set and class label b. To this, we 
proposed an adaptive boosting [42] and gradient tree boosting [43] ensemble model that takes into account 
the base ensemble classifiers trained using classification and AUC accuracy of each feature and increased to 
three permutations produced by feature selection is known as ensemble feature selection aims to limit the 
impact of high dimensions on learning algorithms while conducting classifier accuracy and developing 
successful ensemble learning systems for classification difficulties. 

The general process of ensemble feature selection is shown in Figure 1. The fundamental concept is 
to use loss (W) of accuracy as shown in (1) of individual feature weight based on the feature subset is divided 
randomly and diversity across the chosen feature subsets is ensured, and finally the mean of all loss of all 
classification methods has been computed based on (2) to perform consistently and how effectively each 
single feature separates the given dataset D=(a;, b;) to all 15 features they can distinguish between examples 
of proposed classifier models. 


P = ab (m; .accuracy (D) — mj .accuracy (D — a;)) 0) 


In (1) ¥ represents loss of boosting model on each attribute a; and mj . accuracy (D) denotes 
accuracy measure of boosting model mj. 


meani = D} lsa, i (2) 


The proposed approach has two ensemble methods, in this approach both ensembles choose the 
correct class label, resulting in the correct conclusion have assigned a class label to each unique occurrence 
of the dataset, and the final class label is selected by a frequently occurrence in both methods as optimal. To 
support the first phase, the following algorithms 1 and 2 are taken. 


Algorithm 1. Pseudo code of Ada-Boost for feature selection 


Input: 
Training Dataset D= {a;b;}} where a; € R? and b; € {—1, +1} 
Set of a features F, = {4,42 ..., Qa} 
Output: 
Optimal features based on ranks R = (D, {fa,s faz Taa }) 
Begin : 
1: Initialize weights w,; =.= for b; = 0,1 respectively, for k = 1,2,..,K 


2: Normalize weights Wy; = 5m — 
Èa" 
j=1 
3: Each feature j train a base classifier h; which is classified to using a distinct 
feature. 
4: Calculate error Wg &= Yim; [hj (a,) — bl 
5: opt the classifier hj, with the lowest error €p 


‘ . 1-x; 
6: revise weights Wr41i = Wri B, * 


1 Yea Oe he (@) 2 SLi Me 


7: Repeat until weights upto get final h(x)= { 
0 otherwise 
1 
where a, =log — 
Br 


8: end for 
The resulting model outputs are used as the final forecast for test cases. 
Note: Ranking assigned from 1 to 10. 
End 


Algorithm 2. Pseudo code of gradient boost feature selection 


Input: 
Training Set D=={a,,b;}} where a; € R? and b; € {0,1} 
Set of a features F, = {a,ay,..., aq} 
Output: 
To assign rankings of features R = (D, (fa, faz Tay }) 
Begin : 
1: {hy,ho,...,hm,} © train GBT 
2: i e [0,..,0] 
3: for each hp in {hy,hz,...,hm,} do 
4: for j =1 tod do 
s s 1 
5: T= fj +> Gm) 
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feck 
wal; 
fori= 1tondo 
ay [ijaij]i,=en where threshold th € (0,1) 
Return {(a,(i),b;),i = 1,2,..n} 
0: The dataset D is returned preserving only the selected features 
The resulting model outputs features importance according to weights. 
Note: Ranking assigned from 1 to 10. 


End 


To reach the proposed approach, this research used the COCOMO-81 dataset, as shown in Table 2, 
which accomplishes the selection of weighted features by rating their classification accuracy and AUC value 
in an initial model that includes all predictors. The gradient tree boost model uses a greedy optimisation 
strategy to identify the top-performing subset of elements based on the proposed Ada-Boost [35]. The dataset 
in Table 2 contains 17 features from F, to F15 to as contributing, and the dependent as "LOC," in F16 and 
F 47 as another target was "actual cost". 


Table 2. Dataset information for COCOMO-81 


eg Description of feature Code Value Pew Bogat Std PE Std 

F, Required software reliability rely 0.01131 0.002986  0.00267857 0.00087617 
F, Data base size data 0.027381 0.002393 0.100645 0.0172055 
F, Process complexity cplx 0.001091 0.000982 0.0168651 0.00505271 
F, 5 Time constraint for cpu modern time 0.009921 0.00271 0.126438 0.0119631 
F; 3 Main memory constraint stor 0.00377 0.001964  0.0274802 0.0165 147 
Fs 5 Machine volatility virt 0.000794 0.00014 0.00124008 0.00059936 
F, E Turnaround time turn = 0.003373 0.000612  0.00763889 0.00092 
Fg G Analysts capability acap 5 0.001687 0.001148 0.003125 0.00107994 
Fy £ Application experience aexp E 0.028373 0.00231 0.0198413  0.00546806 
Fio 5 Programmers capability pcap Z 0.002679 0.000643 0.00128968 0.00101171 
Fii [es Virtual machine experience vexp -9.92E-05 0.00014 1.98E-04 0.0001403 
Fy Language experience lexp 0.000198 0.00014 0.0014881 0.00084179 
Fiz Programming practices modp 0.000595 0.000643 0.018502 0.00633835 
Fig Use of software tools tool 0.001984 0.000982 0.00138889 0.0002806 
Fis Schedule constraint sced 0.000496 0.000506 0.00530754 0.00401939 
Fie Target Lines of code LOC 

Fz Actual cost Actual 


From the above, collected sufficient number of features based on their ranks with respect to two targets. 
Then these sets of data forwarded to seven classifiers and calculate the loss Y of each one as shown in (3): 


y fa = ab (M,.acc(T,)) (3) 


Where meani., calculates mean loss of each permutation to three permutations for all individual classifiers. 
When working on a specific learning set, the stacked model can be thought of as a method of calculating all 
base classifier losses XN} Y and then correcting prediction residuals using the level 1 model. The mean of 
all accuracy losses is derived using (5), which stands for the mean of all accuracy losses. 


Meany” = Dia YM” (4) 


In (4), Yu represents the loss of classifier M„ on selected feature f; and M,.acc(T,) denotes 


accuracy measure of classifierM,. In (4), Mean,” represents mean loss upon ranked features M; from all 
classifiers. The overall accuracy is produced in the order that optimal features are selected based on the 
ranking. The ensemble classifier was used to choose and consider the top features for inclusion in the 
classification model based on the output of the ranked features that were analyzed. 


3.2. Experimental setup and simulation 

The proposed research was executed on a system which contains Intel(R) i> — 6200 CPU 3.40 GHz, 
8 GB (RAM), and a 64-bit latest Windows-10 (OS) GUI interface. Python Anaconda is an open-source 
programming language, and Spyder IDE is used for the simulation. Table 3, all the classifiers' parameters are 
selected using a trial-error method. 
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Table 3. Parameter setup 


Base models Parameter setup 
KNN Number of neighbours (k): 5} 
Naive bayes No hyperparameters to tune } 
SVM poly C (regularization parameter): 1.0}; {kernel: polynomial}; {degree of polynomial kernel: 3} 
SVM RBF {C (regularization parameter): 1.0}; {kernel: RBF} 
NN 30-30 {Neurons per layer: 30}; {activation function: relu for hidden} 
{layers, softmax for output layer}; {learning rate: 0.001}; {training duration: 100 epochs} 

LR Regularization type: 12 (ridge)}; {regularization parameter (c): 1.0} 

training duration: 100 epochs} 
Proposed stacked ensemble model Base models: nn 3030 classifier, SVM RBF, NB, SVM poly, LR 


Meta learner: logistic regression} 
{Number of base models: 6} 
{Hyperparameters for base models and meta learner tuned during stacking} 


4. RESULT ANALYSIS 

As stated in the first phase results in Table 2, using two boosting models, optimal features were 
identified, and ranks were assigned. Those two models gave good recognition to the elements in standard, 
prepared an optimal dataset with the optimal ranked features, and then calculated mean loss and rank. 
Figure 2 shows an analysis of the optimal dataset with ten features assigned positions for each element by 
applying GT Boosting classification accuracy (CA) and AUC score. Figure 3 represents classifier accuracy 
and AUC of the Ada-Boost classifier of optimal features. Calculated ranks for top ten features with the 
support of Ada-Boost classifier. Here, we can observe the positions for the top ten out of fifteen parameters 
based on their scores. 


o 0.01 0.02 0.03 
Decrease in AUC 


Figure 2. Gradient tree boosting classifier feature ranking 


stor M ——— 

aexp M — 
modp M — 

ok — 

turn [+ 

sced a 

acap 

rely i} 

o 0.02 0.04 0.06 0.08 0.1 0.12 0.14 


Decrease in AUC 


Figure 3. Ada-Boosting classifier feature ranking 


According to algorithms 1 and 2, calculate CA and AUC scores each boosting algorithm assigns 
ranks as shown in Table 4. Positions are given to all fifteen features such as Fyety, Faata F cpix, 
Fitime F stor) Fvirt F turw F acap» F aexp F pcap Fvexp» Frexp: Fmodp» Ftoov F scea- Out of fifteen features GT 
Boosting CA is very low for Fyire Fvexp Fiexp)F modp Fscea features, and Ada-Boost CA is very low 
Fyirt F pcap) F vexp) F texp) F toot. SO, Which features are commonly identified and get low accuracies for both 
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algorithms are removed from the original dataset and a new subset of features with twelve features and this 
subset of features dataset forward to next level ensemble approach. 


Table 4. Rankings of features for predicting boosting classifiers 
Optimal features based on their ranks 
Adaptive boost classifier GT boost classifier 

(Fi, Fa, Fis, Fy, Fa, Fia, Fo, Fs, Fa, Fa] (Fs, Fo, Fia Fio, Fy, Fs, Fo, Fy, Fa, Fo 

required software reliability (rely), analysts capability process complexity (cplx), analysts capability (acap), 
(acap), schedule constraint (sced), turnaround time (turn), programmers capability (pcap), use of software tools 
process complexity (cplx), programming practices (modp), (tool), turnaround time (turn), main memory 
application experience (aexp), data base size (data), main constraint (stor), application experience (aexp), 
memory constraint (stor), time constraint for cpu modern required software reliability (rely), data base size 
(time). (data), time constraint for cpu modern (time). 

New optimal subset features total 12 [EL Fo, Fz, Fy, Fs, Fz, Fg, Fo, Fao; Faa: Fag Fas] 


4.1. Discussion on level 1 results 

As per MLE proposed model the first level conducting experiments for the original dataset and the 
performance metrics of all six classifiers (NN 30-30, NB, SVM RBF, SVM poly, K-NN, and LR) concerning 
the LOC and actual cost as targets reside in the original dataset, based on the experiments calculate the CA, 
AUC, F1, precision and recall to all classifiers and outcomes showed in Table 5 for each model. All models 
have shown relatively good performance in predicting the performance metrics. NB scored 97%, and K-NN 
scored 99%, indeed a better performance compared to all other models, as the process's objective was to 
predict the actual effort and LOC in the target dataset for developing the SDEE. SVM poly stands good in 
CA with 95% precision and a low error rate, and the NN 30-30 classifier stands at an accuracy of 54% on 
other models to predict targets. 


Table 5. Classifiers performance was observed with 15 features 
Performance measures with 15 features 


S.No Paon madel AUC CA Fl-score Precision Recall 
1. NN 30-30 classifier 0.831 0.476 0.438 0.545 0.476 
2, NB 0.973 0.667 0.694 0.846 0.667 
3. SVM RBF 0.541 0.825 0.833 0.851 0.825 
4. K-NN (EQUL) 0.993 0.873 0.870 0.883 0.873 
5. SVM poly 0.651 0.937 0.939 0.955 0.937 
6. LR 0.906 0.873 0.873 0.908 0.873 


The ROC-AUC curve in Figure 4 shows the relationship between the true positive rate (TPR), which 
measures the model sensitivity and the false positive rate (FPR) which measures model specificity for the 
original dataset, which participates in fifteen features and finds the targets as LOC and actual cost, both are 
proportional. 

According to the ROC curve, NB performs better than the remaining individual classifiers, all 
indicating respective colours, as shown in Figure 4. A lift curve analysis helps evaluate the performance of 
different models, especially in classification tasks. Figure 5 shows the K-NN model achieves an 
exceptionally high AUC of 4.65 at a probability threshold of 0.0, indicating that it can make highly accurate 
predictions when selecting the nearest neighbors. The SVM model with a polynomial kernel has a lower 
AUC of 0.508 at a probability threshold 0.168. This suggests that it may not perform as well. 


4.2. Discussion on stacking ensemble learning approach at level 2 

Based on the research proposal in the second level, we used an effective ensemble learning 
algorithm that learns how to combine predictions from two or more base ML techniques, the stacked 
ensemble. It is used to train models and make predictions, and the advantage of stacking is that it can 
combine the abilities of six high-performing models on a classification and regression difficulties task to 
provide forecasts that perform better than any one model in the ensemble [44]. Using this method, the 
performances of multiple models are integrated to create a single, effectual output. This method uses level 1 
as a base model fitting to the training data and whose predictions are generated and level 2 as a meta-model 
that learns how to best combine the base models' predictions. The results of the fundamental learners' 
developing features can be integrated using a weighted average. It grants the model dominance in prediction 
performance as well as reliability. Also, the high-impact features selected by RF and GTB can be seen in 
Table 4, with the optimal subset of features a new featured dataset prepared separately and given as input to 
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all classifiers, including proposed stacked ensemble and conduct experiments. For this experiment, actual 
cost and LOC are the target variables (both are proportional) to find the performance of the proposed model, 
including all the model's accuracies, as shown in Table 6. The model performance evaluation used the same 
metrics followed by level 1. The evaluation mechanism used in this study focuses on assessing the 
performance of the new optimal feature dataset in predicting the development effort. 


TP Rate (Sensitivity) 
> 


E LR-1 
[ Œ SVM(RBF) 


E SVM(POLY) 

a I Naive Bayes 
NN 3030 
E KNN(EQUL) 


o 0.2 0.4 0.6 0.8 1 
FP Rate (1-Specificity) 


Figure 4. Performance of all models on fifteen features 


o 


Area under the curve 


=LR-1: 1.834 
= SVM(RBF): 1.961 
=SVM(POLY): 0.508 
= Naive Bayes: 2.742 
= NN 3030: 2.167 
se KNN(EQUL): 4.65 


Probability threshold(s): | 


Figure 5. Lift curve of all models on original dataset 


Table 6. Performance of ensemble model in comparison to that of classifiers for optimal features 
Performance measures with 12 features 


a Basemodel AUC CA Fl-score Precision Recall 
1. NN 30-30 classifier 0.841 0.492 0.481 0.580 0.492 
2. Navie bayes 0.977 0.667 0.692 0.781 0.667 
3. SVM RBF 0.606 0.810 0.812 0.833 0.810 
4. K-NN (EQUL) 0.983 0.873 0.870 0.883 0.873 
5. SVM poly 0.965 0.921 0.925 0.947 0.921 
6. LR 0.851 0.794 0.791 0.807 0.794 
7. Stacked ensemble 0.989 0.990 0.990 0.991 0.996 


After an experimental study with a prosed model conducting experiments with twelve features, all 
classifiers have shown good performance compared with level 1 results in predicting the performance 
metrics. The outcome of our research proposal stacked ensemble scored 99%, and K-NN scored 98%, indeed 
a better performance compared to all other models, as the objective of the process was to predict the actual 
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effort and LOC in the target dataset for developing the SDEE. The proposed stacked ensemble stands at 99% 
in CA, 99% precision and less error to predict the target. 

The ROC-AUC curve in Figure 6 shows the relationship between TPR, which measures the model 
sensitivity, and FPR, which measures model specificity for the optimal feature subset of the dataset which 
participates in twelve features and finds the targets as LOC and actual cost, (both are proportional). 
According to the ROC curve, the proposed stacked ensemble reached 1 to perform all-time better than all 
remaining individual classifiers, all indicating respective colours, as shown in Figure 6. 


TP Rate (Sensitivity) 
o 


E LR-1 
0.2 E SVM(RBF) 
E SVM(POLY) 
ot W Naive Bayes 
E NN 3030 
E KNN(EQUL) 
0.293 E Stack | 
o 0.2 04 06 08 1 | 
FP Rate (1-Specificity) 


Figure 6. Performance of all models on optimal feature subset dataset 


The probability of the threshold of the proposed stacked ensemble classifier for the optimal feature 
subset dataset (12 features) was also calculated to find the error in each effort category. Figure 7 shows the lift 
curve analysis for each model, including the stacked ensemble concerning LOC and Actual as targets. The 
proposed model achieves an exceptionally high AUC of 2.783 at a probability threshold of 0.025, indicating that 
it can make highly accurate predictions compared to other individual models. The proposed model has a lower 
high AUC at a probability threshold 0.0258. This suggests that our proposed model performs well, indicating 
that it can make highly accurate predictions comparatively with the state-of-the-art models. 


o 


Area under the curve 
=LR-1: 2.34 
= SVM(RBF): 2.596 
= SVM(POLY): 2.397 


Probability threshold(s): 
— 0.158 


— 0.117 
0. 


= Naive Bayes: 2.726 
=NN 3030: 2.089 
se KNN(EQUL): 4.65 
= Stack: 2.783 


_P Rate 


Figure 7. Lift curve of all models on optimal feature subset dataset 
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Compared to the feature ranking technique, the accuracy obtained by stacked ensemble learning for 
all classifiers as shown in Table 6 has been impossibly promising. This is partly due to the stacked ensemble- 
learning algorithm's dedicated targeting of the top features based on ranking. Compared to the ideal dataset, 
Figure 8 displays the classification accuracy and AUC values for the base learners and the proposed stacked 
ensemble approach. The proposed stacking ensemble produced the all-time highest accuracy in predicting 
SDEE and proves this research study's objective. 


Figure 8. Model analysis of proposed model on optimal feature subset dataset 


5. CONCLUSION 

This research examined ways to estimate the software development effort with minimum time. We 
noticed that using a few outstanding features yields a considerably greater AUC than the alternative. In 
addition, NB and K-NN outperformed comparative with traditional techniques like NN 30-30, SVMs (poly, 
RBF), and LR on the COCOMO dataset, and we proved that they usually incorporate more essential features 
in achieving an acceptable level of accuracy and indicating that giving features weights may improve SDEE 
when employing individual classifiers. For this, we propose a multi-level ensemble model to predict 
outstanding features based on priority to estimate development effort by adopting a stacked ensemble with a 
group of six well-designed learners, which outperformed and higher AUC measurements over the more 
traditional techniques like NN 30-30, SVMs (poly, RBF), NB, K-NN, and LR. Based on the No Free Lunch 
theorem, according to “No Free Lunch hypothesis” No-one ML classifier model is the best on every 
situation, so when taking into account different ensembles, our work has revealed that it is improbable that 
there is a model that is always the best. The software product manager should ideally test several models 
while employing a guiding framework considering all the goals and projects they access. It makes it possible 
to pinpoint the model to deliver the behaviour that best suits the manager's requirements. In future, we plan to 
use further feature selection approaches to support our claim that many features in publicly available 
software product datasets are unnecessary or redundant. Investigating additional ensemble learners to 
contrast our system will also be part of future work. 


APPENDIX 
Table 1. Research on software effort estimation by adopting various appraochs 
S.No Techniques used Data-set used Problem name State of art Metrics used for study _Ref. 
1. RSA USP05-FT Feature reduction FFNN —  MMRE [22] 
USP05-RQ NB - RMSE 
— MAE 
2. DTF ISBSG SEE DT — MRE [23] 
Desharnais MLR — MMRE 
— MdMRE 
— PRED 
3. COCOMO NASA 93 SDEE NB - AUC [24] 
LR - CA 
RF — Precision 
— Recall 
4. ANN COCOMO II Minimize - - MMRE [25] 
predetermined error = MSE 


Feature importance for software development effort estimation using ulti level ensemble ... (K. Eswara Rao) 


1100 O ISSN: 2302-9285 
Table 1. Research on software effort estimation by adopting various appraochs (continued) 
S.No Techniques used Data-set used Problem name State of art Metrics used for study Ref. 
S; GLM ISBSG SEE SVM - MAE [26] 
MLP - RMSE 
— MMER, etc 
6. RF ISBSG SEE - — PRED [18] 
COCOMO = MRE 
—  MMRE 
— PRED 
7. GA, PSO, FL, ACO, ABC - Predict reliability - - [27] 
8. SEER-SEM COCOMO FPA - — MMRE [28] 
— PRED 
9. OLS COCOMO Regression based ML = MAE [29] 
SWR MAXWELL effort estimation = BMMRE 
RR CHINA 
10. ABEO-KN Promise Ranking of Analog based - MMRE [30] 
Repository estimation methods methods = MAR 
datasets = MdAR 
- SD 
- RSD 
- LSD 
11. ASEE Desharnais SDEE Analog based — MMRE [31] 
ISBSG SDEE = PRED 
Albrecht —  MdMRE 
COCOMO - MRE 
Kemerer 
12. ANN COCOMO Estimating effort - = MMRE [32] 
— PRED 
— RMSE 
13. Classical analogy ISBSG SEE Fuzzy analogy — MAE [33] 
Ensemble models = LSD 
—  MBRE 
—  MIBRE 
14. GP, MOGP Desharnais, Accuracy - — MMRE [34] 
Finnish - PRED 
Miyazaki — MdEMRE 
15. Multi layered feed forward ©COCOMOITII Prediction of - — MSE [35] 
neural network software effort = MMRE 
(MLFFANN) 
16. Fuzzy logic - SCE Bailey Basili, — MRE,MF [36] 
Dotly, Halstead — MMRE 
17. ABE ISBSG Predict SEE CART — MRE [37] 
MLR — MMRE 
CNN - PRED 
18. Ada Boost Desharnais LOC, K-NN — Loss [38] 
MAXWELL actual cost SVM = Accuracy 
19. ML Infoway SDE - — BRE [39] 
CBR Diyatech = MRE 
Tsoft 
20. Metaheuristic optimization NASA GA, PSO,FA — MAE [40] 
— MMRE 
— VAF 
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